import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=FutureWarning)
from IPython.display import Image
from IPython.core.display import HTML
import numpy as np
import pandas as pd
from itertools import permutations
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
import pprint
pp = pprint.PrettyPrinter()
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, KFold, GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.cluster import KMeans, AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, classification_report
from xgboost import XGBClassifier, plot_importance
1. Introduction and overview of the problem¶
The data set we will be looking at describes various aspects of a country's development for 160 countries around the world. The fictitious scenario is that we are an NGO that needs to decide how to allocate it's funding according to the data it has on the countries. Since this is an unlabeled data set, we would not be able to compare the performance of any unsupervised algorithms without bringing in a set of labels.
Here is the twist I'm adding to the scenario: As the leaders of the NGO, we have decided that the features in our data set are the correct ones to use in allocating our funds, but we'd like to see if the United Nations has existing classifications and designations that align with our chosen features. To determine this, we will join in data from two lists. First is the United Nations World Economic Situation and Prospects (WESP) designation. This is the commonly referred to list of developed, developing and in-transition countries. Second is the United Nations Development Programme's Human Development Index (HDI). The HDI rates countries as Very High, High, Medium and Low on a development scale that measures health, knowledge and standard of living dimensions.
The approach for this project is to take the unlabeled country data and use K-Means and Hierarchical Clustering to identify clusters in the data. We will then use the generated clusters as classes, and compare them to the two UN classifications. Finally, we will use XGBoost to build a classic supervised model trained on the two UN classifications.
The goal is not to see if we can replicate the UN classifications from our data, but rather to see if the patterns in our data lead to similar classification as the UN lists. If our models generate similar results from our data set, then we can conclude that the UN uses similar data to derive their classifications. What is more likely is that we will discover that our data does not lead to the same classification as the UN, either because our model is different, our data is different, or the UN applies some subjective logic to classifying countries.

2. Description of the data sources¶
Country data. This data is available in a few places on Kaggle, but I retrieved it from here: https://github.com/pycaret/pycaret/blob/master/datasets/country-data.csv. It contains data on the human living conditions in each country like child mortality rates, life expectancy, percent of GDP spent on healthcare, inflation, and others.
The UN WESP data was retrieved from: https://www.un.org/en/development/desa/policy/wesp/wesp_current/2014wesp_country_classification.pdf. It contains, among other things, classifications of countries into Developing, In Transition and Developed.
The UN Human Development Index data was retrieved from: https://hdr.undp.org/data-center/human-development-index#/indicies/HDI. It categorizes countries into Very High, High, Medium and Low levels on the HDI scale.
3. Data cleaning¶
I was hoping to be able to simply join the three data sets together based on country name, but upon inspection, this proved to not be possible. The names of the countries was not reliably the same across the three lists. One discrepancy is the use of official vs colloquial country names. For example: South Korea vs Republic of Korea and North Korea vs Democratic People's Republic of Korea or Venezuela vs Bolivarian Republic of Venezula. Other discrepancies arose from the inconsistent use of diacritics and other special characters. For example Turkey vs Türkiye. In the end I manually merged the three lists. Luckily there were only 160 countries in the resulting merged list.
Other than that, the data is clean. There is no missing data to be imputed or duplicated rows to be resolved.
country_data = pd.read_csv('data/Country-data_augmented.csv')
country_data
| Country | Child Mortality | Exports | Health | Imports | Income | Inflation | Life Expectancy | Fertility | GDPP | WESP Classification | HDI Classification | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | 90.2 | 10.0 | 7.58 | 44.9 | 1610 | 9.44 | 56.2 | 5.82 | 553 | Developing | Low |
| 1 | Albania | 16.6 | 28.0 | 6.55 | 48.6 | 9930 | 4.49 | 76.3 | 1.65 | 4090 | In transition | High |
| 2 | Algeria | 27.3 | 38.4 | 4.17 | 31.4 | 12900 | 16.10 | 76.5 | 2.89 | 4460 | Developing | High |
| 3 | Angola | 119.0 | 62.3 | 2.85 | 42.9 | 5900 | 22.40 | 60.1 | 6.16 | 3530 | Developing | Medium |
| 4 | Antigua and Barbuda | 10.3 | 45.5 | 6.03 | 58.9 | 19100 | 1.44 | 76.8 | 2.13 | 12200 | Developing | High |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 155 | Vanuatu | 29.2 | 46.6 | 5.25 | 52.7 | 2950 | 2.62 | 63.0 | 3.50 | 2970 | Developing | Medium |
| 156 | Venezuela | 17.1 | 28.5 | 4.91 | 17.6 | 16500 | 45.90 | 75.4 | 2.47 | 13500 | Developing | Medium |
| 157 | Vietnam | 23.3 | 72.0 | 6.84 | 80.2 | 4490 | 12.10 | 73.1 | 1.95 | 1310 | Developing | High |
| 158 | Yemen | 56.3 | 30.0 | 5.18 | 34.4 | 4480 | 23.60 | 67.5 | 4.67 | 1310 | Developing | Low |
| 159 | Zambia | 83.1 | 37.0 | 5.89 | 30.9 | 3280 | 14.00 | 52.0 | 5.40 | 1460 | Developing | Medium |
160 rows × 12 columns
# Split dependent and independent vars, label encode the categorical targets.
country_names = country_data['Country']
X = country_data.drop(['Country', 'WESP Classification', 'HDI Classification'], axis=1)
y = country_data[['WESP Classification', 'HDI Classification']].copy()
le_3 = LabelEncoder()
le_4 = LabelEncoder()
y['wesp_label'] = le_3.fit_transform(y['WESP Classification'])
y['hdi_label'] = le_4.fit_transform(y['HDI Classification'])
print(f'{country_data.shape=}')
print(f'{country_names.shape=}')
print(f'{X.shape=}')
print(f'{y.shape=}')
country_data.shape=(160, 12) country_names.shape=(160,) X.shape=(160, 9) y.shape=(160, 4)
# Look for missing data
country_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 160 entries, 0 to 159 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Country 160 non-null object 1 Child Mortality 160 non-null float64 2 Exports 160 non-null float64 3 Health 160 non-null float64 4 Imports 160 non-null float64 5 Income 160 non-null int64 6 Inflation 160 non-null float64 7 Life Expectancy 160 non-null float64 8 Fertility 160 non-null float64 9 GDPP 160 non-null int64 10 WESP Classification 160 non-null object 11 HDI Classification 160 non-null object dtypes: float64(7), int64(2), object(3) memory usage: 15.1+ KB
# Look for duplicate country names.
country_data['Country'].nunique()
160
4. Exploratory data analysis¶
Looking at the distribution of country labels in the WESP classification, we see that there lots of developing countries and few in-transition countries. The definition of in-transition was a bit vague and the UN documentation admits that there is quite a bit of overlap between the three categories. The histogram shows a more even distribution among the HDI classes.
The pairplots for WESP reveal that this could be a challenging clustering problem. The developed and in-transition countries are fairly well grouped, but the developing countries cover a wide range of values in almost all the dimensions, making it hard to separate the three groups.
The pairplots for HDI reveal a similar pattern, with definite concentrations appearing, but no distinct clusters. I think in the end, we will find that this data is quite continuous and that classification is more a matter of drawing arbitrary thresholds than identifying completely distinct categories of countries. ie: we'll arbitrarily say that countries with a score of 1-2 are low, 3-5 are medium, 6-8 are high and 9-10 are very high for example.
Looking at the boxplots for individual features, separated by classes reveals similar insights. None of the features are cleanly separable by class, there is a lot of overlap. For the WESP classes, Developed seems fairly distinct, while In Transition and Developing largely overlap. For the HDI classes, there seems to be a nice linear progress from Low to Very High, which suggests those classes were derived using similar data to ours.
My intuition is that we will have reasonable success at aligning our clusters with the UN data, with more success on the HDI classes than the WESP classes.
wesp_cat_order = {
'WESP Classification': ['Developing', 'In transition', 'Developed']
}
hdi_cat_order = {
'HDI Classification': ['Low', 'Medium', 'High', 'Very High']
}
fig = px.histogram(country_data, x='WESP Classification', histnorm='probability density', category_orders=wesp_cat_order)
fig.show()
fig = px.histogram(country_data, x='HDI Classification', histnorm='probability density', category_orders=hdi_cat_order)
fig.show()
sns.pairplot(country_data, hue='WESP Classification');
sns.pairplot(country_data, hue='HDI Classification');
for col in X.columns:
fig = px.box(country_data, x="WESP Classification", y=col, category_orders=wesp_cat_order)
fig.show()
for col in X.columns:
fig = px.box(country_data, x="HDI Classification", y=col, category_orders=hdi_cat_order)
fig.show()
5. Modeling¶
We start with a K-Means model, applying the elbow method to determine an appropriate K. From the chart, any value between 3 and 5 would be reasonable. This is fortunate, since the two sets of labels we have contain 3 and 4 labels. We will continue the rest of the analysis using both the 3-category WESP labels and the 4-category HDI labels.
We then fit a hierarchical clustering model with 3 or 4 clusters against the WESP and HDI data respectively.
Finally we fit an XGBoost model to see how the two unsupervised models above compare to a supervised model.
This gives us a total of 6 models (3 types of model times 2 sets of labels). We compute weighted F1 scores and display a confusion matrix for each model and will discuss the results in the following section.
# Scaling
scaler = MinMaxScaler()
features = X.columns
X_scaled = pd.DataFrame(scaler.fit_transform(X))
X_scaled.columns = features
X_scaled.describe()
| Child Mortality | Exports | Health | Imports | Income | Inflation | Life Expectancy | Fertility | GDPP | |
|---|---|---|---|---|---|---|---|---|---|
| count | 160.000000 | 160.000000 | 160.000000 | 160.000000 | 160.000000 | 160.000000 | 160.000000 | 160.000000 | 160.000000 |
| mean | 0.176552 | 0.206024 | 0.310938 | 0.265554 | 0.135562 | 0.112954 | 0.757643 | 0.287549 | 0.123879 |
| std | 0.199770 | 0.139603 | 0.169209 | 0.140476 | 0.157512 | 0.099178 | 0.178836 | 0.242382 | 0.177567 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 0.025073 | 0.117269 | 0.193599 | 0.171094 | 0.021714 | 0.056210 | 0.638067 | 0.100552 | 0.010299 |
| 50% | 0.081061 | 0.175551 | 0.280609 | 0.246554 | 0.079113 | 0.089363 | 0.811637 | 0.198738 | 0.044087 |
| 75% | 0.290652 | 0.256595 | 0.426352 | 0.331787 | 0.185029 | 0.141715 | 0.884615 | 0.468454 | 0.140490 |
| max | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
# KMeans elbow method
ssd = []
for k in range(1, 11):
km = KMeans(n_clusters=k, random_state=42)
km = km.fit(X_scaled)
ssd.append(km.inertia_)
plt.plot(range(1, 11), ssd, 'bx-')
plt.xlabel('K')
plt.ylabel('Sum of squared distances')
plt.title('K vs SSD')
plt.show()
# KMeans predictions for 3 and 4 clusters
km_3 = KMeans(n_clusters=3, random_state=42)
y_pred = pd.DataFrame(km_3.fit_predict(X_scaled))
y_pred.columns = ['km_3']
km_4 = KMeans(n_clusters=4, random_state=42)
y_pred['km_4'] = km_4.fit_predict(X_scaled)
# Hierarchical clustering predictions for 3 and 4 clusters
hc_3 = AgglomerativeClustering(n_clusters=3, compute_distances=True).fit(X_scaled)
y_pred['hc_3'] = hc_3.labels_
hc_4 = AgglomerativeClustering(n_clusters=4, compute_distances=True).fit(X_scaled)
y_pred['hc_4'] = hc_4.labels_
# Source: https://scikit-learn.org/stable/auto_examples/cluster/plot_agglomerative_dendrogram.html
def plot_dendrogram(model, **kwargs):
# Create linkage matrix and then plot the dendrogram
# create the counts of samples under each node
counts = np.zeros(model.children_.shape[0])
n_samples = len(model.labels_)
for i, merge in enumerate(model.children_):
current_count = 0
for child_idx in merge:
if child_idx < n_samples:
current_count += 1 # leaf node
else:
current_count += counts[child_idx - n_samples]
counts[i] = current_count
linkage_matrix = np.column_stack(
[model.children_, model.distances_, counts]
).astype(float)
# Plot the corresponding dendrogram
dendrogram(linkage_matrix, **kwargs)
plot_dendrogram(hc_3, p=3)
plot_dendrogram(hc_4, p=4)
def label_permute_compare(y_true, y_pred, categories):
best_acc = 0
best_labels = None
n = len(categories)
for label_order in permutations(range(n), n):
cats = {categories[i]: label_order[i] for i in range(n)}
label_candidate = [cats[x] for x in y_true]
acc = f1_score(label_candidate, y_pred, average='weighted')
if acc > best_acc:
best_acc = acc
best_labels = label_order
return best_labels, best_acc
# Performance evaluation
categories_3 = ['Developed', 'Developing', 'In transition']
categories_4 = ['High', 'Low', 'Medium', 'Very High']
results = {}
# K-Means performance for 3-category WESP labels.
results['km_3'] = label_permute_compare(y['WESP Classification'], y_pred['km_3'], categories_3)
y_pred['km_3_classes'] = [categories_3[results['km_3'][0].index(i)] for i in y_pred['km_3']]
results['km_3']
((2, 0, 1), 0.6113377980300101)
print(classification_report(y['WESP Classification'], y_pred['km_3_classes']))
precision recall f1-score support
Developed 0.79 0.73 0.76 37
Developing 1.00 0.43 0.60 107
In transition 0.20 1.00 0.33 16
accuracy 0.56 160
macro avg 0.66 0.72 0.57 160
weighted avg 0.87 0.56 0.61 160
cm = confusion_matrix(y['WESP Classification'], [categories_3[results['km_3'][0].index(i)] for i in y_pred['km_3']])
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=categories_3)
disp.plot()
plt.show()
# Hierarchical clustering performance for 3-category WESP labels.
results['hc_3'] = label_permute_compare(y['WESP Classification'], y_pred['hc_3'], categories_3)
y_pred['hc_3_classes'] = [categories_3[results['hc_3'][0].index(i)] for i in y_pred['hc_3']]
results['hc_3']
((0, 2, 1), 0.5584425625920472)
print(classification_report(y['WESP Classification'], y_pred['hc_3_classes']))
precision recall f1-score support
Developed 0.73 0.65 0.69 37
Developing 0.67 0.54 0.60 107
In transition 0.00 0.00 0.00 16
accuracy 0.51 160
macro avg 0.46 0.40 0.43 160
weighted avg 0.61 0.51 0.56 160
cm = confusion_matrix(y['WESP Classification'], [categories_3[results['hc_3'][0].index(i)] for i in y_pred['hc_3']])
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Developed', 'Developing', 'In transition'])
disp.plot()
plt.show()
# K-Means cluster performance for 4-category HDI labels.
results['km_4'] = label_permute_compare(y['HDI Classification'], y_pred['km_4'], categories_4)
y_pred['km_4_classes'] = [categories_4[results['km_4'][0].index(i)] for i in y_pred['km_4']]
results['km_4']
((1, 0, 3, 2), 0.5241771515945828)
print(classification_report(y['HDI Classification'], y_pred['km_4_classes']))
precision recall f1-score support
High 0.54 0.93 0.68 40
Low 0.63 1.00 0.77 29
Medium 0.00 0.00 0.00 31
Very High 1.00 0.40 0.57 60
accuracy 0.56 160
macro avg 0.54 0.58 0.51 160
weighted avg 0.62 0.56 0.52 160
cm = confusion_matrix(y['HDI Classification'], [categories_4[results['km_4'][0].index(i)] for i in y_pred['km_4']])
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=categories_4)
disp.plot()
plt.show()
# Hierarchical clustering performance for 4-category HDI labels.
results['hc_4'] = label_permute_compare(y['HDI Classification'], y_pred['hc_4'], categories_4)
y_pred['hc_4_classes'] = [categories_4[results['hc_4'][0].index(i)] for i in y_pred['hc_4']]
results['hc_4']
((0, 1, 3, 2), 0.5370577998402374)
print(classification_report(y['HDI Classification'], y_pred['hc_4_classes']))
precision recall f1-score support
High 0.45 0.97 0.61 40
Low 0.68 0.93 0.78 29
Medium 0.00 0.00 0.00 31
Very High 0.97 0.48 0.64 60
accuracy 0.59 160
macro avg 0.52 0.60 0.51 160
weighted avg 0.60 0.59 0.54 160
cm = confusion_matrix(y['HDI Classification'], [categories_4[results['hc_4'][0].index(i)] for i in y_pred['hc_4']])
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=categories_4)
disp.plot()
plt.show()
# Supervised training
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=24)
xgb_3_scores = cross_val_score(XGBClassifier(random_state=42), X, y['wesp_label'], scoring='f1_weighted', cv=5)
print(f'Average 3-class xgb F1 score: {np.mean(xgb_3_scores):.3f}')
Average 3-class xgb F1 score: 0.881
xgb_3 = XGBClassifier(random_state=42)
xgb_3.fit(X_train, y_train['wesp_label'])
y_pred_sup = pd.DataFrame(le_3.inverse_transform(xgb_3.predict(X_test)))
y_pred_sup.columns = ['xgb_3']
print(classification_report(y_test['WESP Classification'], y_pred_sup['xgb_3']))
precision recall f1-score support
Developed 0.89 0.89 0.89 9
Developing 0.86 0.95 0.90 20
In transition 0.00 0.00 0.00 3
accuracy 0.84 32
macro avg 0.58 0.61 0.60 32
weighted avg 0.79 0.84 0.82 32
cm = confusion_matrix(y_test['WESP Classification'], y_pred_sup['xgb_3'])
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=categories_3)
disp.plot()
plt.show()
xgb_4_scores = cross_val_score(XGBClassifier(random_state=42), X, y['hdi_label'], scoring='f1_weighted', cv=5)
print(f'Average 4-class xgb F1 score: {np.mean(xgb_4_scores):.3f}')
Average 4-class xgb F1 score: 0.754
xgb_4_scores
array([0.76788836, 0.80849359, 0.71458333, 0.72291667, 0.7582651 ])
xgb_4 = XGBClassifier(random_state=42)
xgb_4.fit(X_train, y_train['hdi_label'])
y_pred_sup['xgb_4'] = le_4.inverse_transform(xgb_4.predict(X_test))
print(classification_report(y_test['HDI Classification'], y_pred_sup['xgb_4']))
precision recall f1-score support
High 1.00 0.70 0.82 10
Low 0.80 1.00 0.89 4
Medium 0.67 0.67 0.67 6
Very High 0.86 1.00 0.92 12
accuracy 0.84 32
macro avg 0.83 0.84 0.83 32
weighted avg 0.86 0.84 0.84 32
cm = confusion_matrix(y_test['HDI Classification'], y_pred_sup['xgb_4'])
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=categories_4)
disp.plot()
plt.show()
6. Results and analysis¶
The results of the modeling are summarized in the table below.
| 3-category F1 score | 4-category F1 score | |
|---|---|---|
| K-Means | 0.61 | 0.52 |
| Hierarchical Clustering | 0.56 | 0.54 |
| XGBoost | 0.82 | 0.84 |
Looking deeper into the classification report reveals more interesting insights than merely looking at the F1 scores. For example, on the WESP labels, all three models struggled with the in-transition label, particularly with distinguising it from the developing labels. This alings with my intuition from the visualizations that there is a lot of overlap between those two groups.
Looking at the HDI labels, both of the unsupervised labels had the most success with the Low labels and struggled the most with the Medium labels. This also aligns with the EDA, which showed that Low HDI countries appeared more separable. Although the numbers in each box are small because we have a small test set spread across a 4x4 grid, the XGBoost model seems to perform pretty well on the HDI data, with respectable F1 scores for every category.
7. Discussion and conclusion¶
The question we originally posed was whether the country data we obtained could be clustered or modeled with XGBoost to achieve similar results as the two UN classification systems for countries. The results show that there is a better-than-random agreement between the unsupervised models and the UN classes, but it is not very strong. This suggests that (unsurprisingly) the UN uses a different methodology and/or features to derive it's country classifications, but that there is some overlap with the data we have used in this analysis. One would hope that the UN would also use their expert opinions to make judgment calls when designating countries as in-transition, since there does not seem to be a purely data-driven reason for some of these categorizations.